Project 2

Student: Kesav Adithya Venkidusamy
Course: DSC680 - Applied Data Science
Instructor: Professor Catherine Williams
Assignment: Project 2

Life Expectancy Prediction

Idea: Everything has an expiration date; humans are no exception either. The term “life expectancy” refers to the number of years a person can expect to live. By definition, life expectancy is based on an estimate of the average age that members of a particular population group will be when they die. We’re in an unprecedented era where humans are living longer with increased access to modern science and healthcare. It’s no secret, though, that life expectancy varies widely across the globe. Life expectancy depends on several factors, the two most important being gender and birth year. Generally, females have a slightly higher life expectancy than males due to biological differences. Other factors that influence life expectancy include:

In this project, I aim to explore the parameters affecting the life span of individuals living in distinct countries and learn how the life span can be estimated with the help of machine learning models. I will also focus on exploring the parameters that greatly impact the life span of an individual.

Dataset The Global Health Observatory (GHO) data repository under World Health Organization (WHO) keeps track of the health status as well as many other related factors of all countries. The datasets are made available to the public for health data analysis.

https://www.kaggle.com/datasets/kumarajarshi/life-expectancy-who

The data-set related to life expectancy, and health factors for 193 countries have been collected from the same WHO data repository website and its corresponding economic data was collected from the United Nation website.

Abstract: Predict the key drivers for the Life Expectancy

Features and Target present in the dataset

  1. Country - Country Observed
  2. Year - Year Observed
  3. Status - Status of the country; Developed or Developing Status
  4. Adult Mortality - Adult Mortality Rates on both sexes (probability of dying between 15-60 years/1000 population).
  5. Infant deaths - Number of Infant Deaths per 1000 population
  6. Alcohol - Alcohol recorded per capita (15+) consumption (in liters of pure alcohol).
  7. Percentage expenditure - Expenditure on health as a percentage of Gross Domestic Product per capita(%).
  8. Hepatitis B - Hepatitis B (HepB) immunization coverage among 1-year-olds (%)
  9. Measles - Number of reported Measles cases per 1000 population
  10. BMI - Average Body Mass Index of the entire population
  11. Under-five-deaths - Number of under-five deaths per 1000 population
  12. Polio - Polio (Pol3) immunization coverage among 1-year-olds (%)
  13. Total expenditure - General government expenditure on health as a percentage of total government expenditure (%)
  14. Diphtheria - Diphtheria tetanus toxoid and pertussis (DTP3) immunization coverage among 1-year-olds (%)
  15. HIV/AIDS - Deaths per 1 000 live births HIV/AIDS (0-4 years)
  16. GDP - Gross Domestic Product per capita (in USD)
  17. Population - The population of the country
  18. thinness 1-19 years - Prevalence of thinness among children and adolescents for Age 10 to 19 (%)
  19. thinness 5-9 years - Prevalence of thinness among children for Age 5 to 9(%)
  20. Income composition of resources - Human Development Index in terms of income composition of resources (index ranging from 0 to 1)
  21. Schooling - Number of years of Schooling(years)

Target:

Life expectancy — Life expectancy in age

Data Exploration

Importing libraries for data processing
Source Data Analysis

The dataset contains 2938 rows and 22 attributes!

Observation

One interesting thing in this dataset is that the names of the columns are not written in a nice manner, like in some names there is a space before the name, in some after the name and in some names there is a space in both before and after the name like " BMI ". To tackle this, let's print the names of all the columns and rename space with underscore.

Observation

The "Life_expectancy" column is our target variable which is continuous in nature having range of values between 36.3 and 89.

EDA

Handling Null Values

Missing values can prove to be a major pain as they may affect the run of the algorithms as almost all the algorithms expect the full data but return error if some points are missing. So, cleaning the data is essential!

Observation

From the above bar chart, I can see that many attributes have missing values, like Hepatitis B has 2385 values, whereas the expected number of values for every attribute is 2938. Now we have to find a way to fill in all these missing values as these may cause problems for our algorithm. We'll use the impute method of pd.DataFrame.fillna and impute the previous values in all these missing fields, previous values can be a good way to fill in such entries.

Duplicate check

Observation

There is no duplicate value present in the dataframe

Visualizations

Numerical Variable Exploration

Histogram

Observation
Right Skewed

From the histogram chart, we see all the below features are rightly skewed as they have a “tail” on the right side of the distribution. The frequency of occurence of values is high at at the beginning and low towards the end.

Left Skewed

From the histogram chart, we see all the below features are left skewed as they have a “tail” on the left side of the distribution. The frequency of occurence of values is low at the beginning and high at the end .

Normal Distribution

A normal distribution is an arrangement of a data set in which most values cluster in the middle of the range and the rest taper off symmetrically toward either extreme. Below features are having kind of normalized distribution

Mulitmode Distribution

A multimodal distribution is a probability distribution with more than one peak, or “mode.” A bimodal distribution is also multimodal, as there are multiple peaks. The below feature has multimodal distributio

Categorical Variable Exploration

Comparing the life expectancy of Developing and Developed Countries using violin Chart
Observation

Above the graph we could see Developing countries have low life expectancy and the developed countries have high life expectancy all over the world

Country Wise Life Expectancy over the years using country and line plot
Observation

Looking at the chart, we see the life expectancy is high for developed countries compared to developing countries. In addition, we could also notice that the life expectancy increase over the years across the countries.

Observation

As noticed in the country chart, this chart also shows that life expectancy for the countries increases over the years. This is particulary for developing countries compared to developed countries.

Target Variable Analysis

Distribution Plot
Observation:

Distribution plot shows that life expectancy has high value around 72. This includes all the countries and for all the years.

Bar Chart
Observation

The life expectancy is high for developing countries compared to developed countries.

Life Expectancy vs features using Scatter Plot

Observation
Regression Plot
Observation

I could see life expectancy increase for increase in BMI, GDP and income composition of resource and decrease for increase in adult mortality and infant deaths. All these scenarios are expected.

Heat Map

Observation

Feature Engineering

Label Encoder

Observation
  1. The target variable "Life Expectancy" is positively correlated with "Schooling", "Income_composition_of_resources", "BMI", "Diphtheria", "Polio", "GDP", "percentage_expenditure", "Alcohol", "Hepatitis_B","Total_expenditure"

  2. The target variable is negatively correlated with "Adult Mortality", "HIV/AIDS", "thinness_1_19_years", "Status", "under_five_deaths", "thinness_5_9_years", "infant_deaths"

  3. The above correlation is completely make sense as Life Expectancy would be higher if people are educated, took vaccines like Polio, Diphtheria, Hepatits B, Percentage_expenditure spent by the country, Total_expenditure and GDP of the countries. Negative correlation also makes sense for below reasons.

    • HIV/AIDs cause people to die reducing the life expectancy
    • Adult Mortality is high
    • Infant deaths and under_five_deaths are high
    • Thinness_1_19_years and thinness_5_9_years are high

Modeling

Linear Regression without Normalization

Running Linear Regression

The dataset has been successfully splitted into the test and train sets respectively. The division is purely random and each time we run the code, a new training set and testing set will be created. Let's run the resgressor algorithm on this training set and then see it's accuracy using the testing set!

Without Normalization
Linear Regression

So, according to our current model, the accuracy that we have got is 83.548%, which is a pretty good score!

Plotting Actual vs Predicted

Linear Regression with MinMaxScaler Normalization

MinMaxScaler has been applied on the dataset, and dataset has been successfully splitted into the test and train sets respectively. The division is purely random and each time we run the code, a new training set and testing set will be created. Let's run the resgressor algorithm on this training set and then see it's accuracy using the testing set!

Linear Regression with Normalization

Observation:

The Root Mean square error has been reduced from 14.236 to 0.00513 after normalizing the dataset. However, the r2 score turned out to be the same (~83.548) as running linear regression without normalization.

Feature Selection

StandardScalar

Lasso Regression

Observation

The best features identified by Lasso Regression are as below. This is same as what is identified by correlation matrix.

OLS Regression (Ordinary Least Square)

Now, if we consider that a good p value would be a one greater than 0.05 or within 5%, then the attributes or features that we can considered as best features are: